Project Report¶
Microsoft Malware Detection : Data preparation and EDA¶
Summary¶
This projects aims to analyze and process the data provided by Microsoft for malware prediction. The project tackles high memory issues first and then moves onto cleaning and exploratory data analysis to prepare the data to fit Machine Learning/ Deep Learning models directly. Project also shares insights gained from the data about the main causes of the malware spread and which kind of devices are more likely to be affected.
Project Flow¶
Data Acquisition: Obtain relevant data from Kaggle competition and download them into CSV files.
Data Source: Original data set is huge and we won't be able to run the whole pipeline on GitHub, that's why a small subset of the data is used which contains only first 100k Rows fromt the original dataset : https://www.kaggle.com/competitions/microsoft-malware-prediction/data
Dataset comprises information about the OS type, Antivirus information, Platform, Touch or Non touch devices etc about the machines and the label in HasDetections which is a binary value describing if a malware was detected on the machine or not.
After downloading the data, csv files are loaded and processed and finally stored in .sqlite and saved. Steps in preprocessing are:
(i). Optimize memory usage by type casting to required dtypes like if the data can be stored in int8 and is using int16 currently then it would reduce memory usage by converting to int8.
(ii) load_and_transform function from pipeline.py, If the proportion (rate) of the most frequently occurring value is greater than 90% (rate > 0.9), it removes that column from the good_cols list.
(iii) This good cols list is also used for test data.
(iv) convert_to_db function converts both train and test data to sqlite db and saves them with respective table names as 'train' and 'test'.
- After this we test our pipeline with some tests to test the functionality of the above features, which can be run using test.sh file.
- Finally we added a CI pipeline which runs all the tests everytime there's an update to the main branch to verify if everything is working or not.
Rest of this report follows data processing, exploration and visualization along with the insights gained from the data and discussing the limitations along with the conclusions in the end.
Libraries¶
Import all necessary libraries like Plotly, Seaborn for visualization and Pandas, Numpy for processing.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import numpy as np
import os
import time
import plotly
import plotly.offline as pyo
import plotly.graph_objs as go
import plotly.tools as tls
pyo.init_notebook_mode()
import gc
from tqdm import tqdm_notebook
import warnings
warnings.filterwarnings("ignore")
Setting max column and row range so that we can see all the columns in data.head()
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)
train = pd.read_csv(r"E:\FAU-SEM2\made-classwork\data\train_sub.csv")
test = pd.read_csv(r"E:\FAU-SEM2\made-classwork\data\test_sub.csv")
train.shape, test.shape
((100000, 83), (100000, 82))
train.head()
| MachineIdentifier | ProductName | EngineVersion | AppVersion | AvSigVersion | IsBeta | RtpStateBitfield | IsSxsPassiveMode | DefaultBrowsersIdentifier | AVProductStatesIdentifier | AVProductsInstalled | AVProductsEnabled | HasTpm | CountryIdentifier | CityIdentifier | OrganizationIdentifier | GeoNameIdentifier | LocaleEnglishNameIdentifier | Platform | Processor | OsVer | OsBuild | OsSuite | OsPlatformSubRelease | OsBuildLab | SkuEdition | IsProtected | AutoSampleOptIn | PuaMode | SMode | IeVerIdentifier | SmartScreen | Firewall | UacLuaenable | Census_MDC2FormFactor | Census_DeviceFamily | Census_OEMNameIdentifier | Census_OEMModelIdentifier | Census_ProcessorCoreCount | Census_ProcessorManufacturerIdentifier | Census_ProcessorModelIdentifier | Census_ProcessorClass | Census_PrimaryDiskTotalCapacity | Census_PrimaryDiskTypeName | Census_SystemVolumeTotalCapacity | Census_HasOpticalDiskDrive | Census_TotalPhysicalRAM | Census_ChassisTypeName | Census_InternalPrimaryDiagonalDisplaySizeInInches | Census_InternalPrimaryDisplayResolutionHorizontal | Census_InternalPrimaryDisplayResolutionVertical | Census_PowerPlatformRoleName | Census_InternalBatteryType | Census_InternalBatteryNumberOfCharges | Census_OSVersion | Census_OSArchitecture | Census_OSBranch | Census_OSBuildNumber | Census_OSBuildRevision | Census_OSEdition | Census_OSSkuName | Census_OSInstallTypeName | Census_OSInstallLanguageIdentifier | Census_OSUILocaleIdentifier | Census_OSWUAutoUpdateOptionsName | Census_IsPortableOperatingSystem | Census_GenuineStateName | Census_ActivationChannel | Census_IsFlightingInternal | Census_IsFlightsDisabled | Census_FlightRing | Census_ThresholdOptIn | Census_FirmwareManufacturerIdentifier | Census_FirmwareVersionIdentifier | Census_IsSecureBootEnabled | Census_IsWIMBootEnabled | Census_IsVirtualDevice | Census_IsTouchEnabled | Census_IsPenCapable | Census_IsAlwaysOnAlwaysConnectedCapable | Wdft_IsGamer | Wdft_RegionIdentifier | HasDetections | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0000028988387b115f69f31a3bf04f09 | win8defender | 1.1.15100.1 | 4.18.1807.18075 | 1.273.1735.0 | 0 | 7.0000 | 0 | NaN | 53,447.0000 | 1.0000 | 1.0000 | 1 | 29 | 128,035.0000 | 18.0000 | 35.0000 | 171 | windows10 | x64 | 10.0.0.0 | 17134 | 256 | rs4 | 17134.1.amd64fre.rs4_release.180410-1804 | Pro | 1.0000 | 0 | NaN | 0.0000 | 137.0000 | NaN | 1.0000 | 1.0000 | Desktop | Windows.Desktop | 2,668.0000 | 9,124.0000 | 4.0000 | 5.0000 | 2,341.0000 | NaN | 476,940.0000 | HDD | 299,451.0000 | 0 | 4,096.0000 | Desktop | 18.9000 | 1,440.0000 | 900.0000 | Desktop | unknown | 4,294,967,295.0000 | 10.0.17134.165 | amd64 | rs4_release | 17134 | 165 | Professional | PROFESSIONAL | UUPUpgrade | 26.0000 | 119 | UNKNOWN | 0 | IS_GENUINE | Retail | NaN | 0.0000 | Retail | NaN | 628.0000 | 36,144.0000 | 0 | NaN | 0.0000 | 0 | 0 | 0.0000 | 0.0000 | 10.0000 | 0 |
| 1 | 000007535c3f730efa9ea0b7ef1bd645 | win8defender | 1.1.14600.4 | 4.13.17134.1 | 1.263.48.0 | 0 | 7.0000 | 0 | NaN | 53,447.0000 | 1.0000 | 1.0000 | 1 | 93 | 1,482.0000 | 18.0000 | 119.0000 | 64 | windows10 | x64 | 10.0.0.0 | 17134 | 256 | rs4 | 17134.1.amd64fre.rs4_release.180410-1804 | Pro | 1.0000 | 0 | NaN | 0.0000 | 137.0000 | NaN | 1.0000 | 1.0000 | Notebook | Windows.Desktop | 2,668.0000 | 91,656.0000 | 4.0000 | 5.0000 | 2,405.0000 | NaN | 476,940.0000 | HDD | 102,385.0000 | 0 | 4,096.0000 | Notebook | 13.9000 | 1,366.0000 | 768.0000 | Mobile | unknown | 1.0000 | 10.0.17134.1 | amd64 | rs4_release | 17134 | 1 | Professional | PROFESSIONAL | IBSClean | 8.0000 | 31 | UNKNOWN | 0 | OFFLINE | Retail | NaN | 0.0000 | NOT_SET | NaN | 628.0000 | 57,858.0000 | 0 | NaN | 0.0000 | 0 | 0 | 0.0000 | 0.0000 | 8.0000 | 0 |
| 2 | 000007905a28d863f6d0d597892cd692 | win8defender | 1.1.15100.1 | 4.18.1807.18075 | 1.273.1341.0 | 0 | 7.0000 | 0 | NaN | 53,447.0000 | 1.0000 | 1.0000 | 1 | 86 | 153,579.0000 | 18.0000 | 64.0000 | 49 | windows10 | x64 | 10.0.0.0 | 17134 | 768 | rs4 | 17134.1.amd64fre.rs4_release.180410-1804 | Home | 1.0000 | 0 | NaN | 0.0000 | 137.0000 | RequireAdmin | 1.0000 | 1.0000 | Desktop | Windows.Desktop | 4,909.0000 | 317,701.0000 | 4.0000 | 5.0000 | 1,972.0000 | NaN | 114,473.0000 | SSD | 113,907.0000 | 0 | 4,096.0000 | Desktop | 21.5000 | 1,920.0000 | 1,080.0000 | Desktop | unknown | 4,294,967,295.0000 | 10.0.17134.165 | amd64 | rs4_release | 17134 | 165 | Core | CORE | UUPUpgrade | 7.0000 | 30 | FullAuto | 0 | IS_GENUINE | OEM:NONSLP | NaN | 0.0000 | Retail | NaN | 142.0000 | 52,682.0000 | 0 | NaN | 0.0000 | 0 | 0 | 0.0000 | 0.0000 | 3.0000 | 0 |
| 3 | 00000b11598a75ea8ba1beea8459149f | win8defender | 1.1.15100.1 | 4.18.1807.18075 | 1.273.1527.0 | 0 | 7.0000 | 0 | NaN | 53,447.0000 | 1.0000 | 1.0000 | 1 | 88 | 20,710.0000 | NaN | 117.0000 | 115 | windows10 | x64 | 10.0.0.0 | 17134 | 256 | rs4 | 17134.1.amd64fre.rs4_release.180410-1804 | Pro | 1.0000 | 0 | NaN | 0.0000 | 137.0000 | ExistsNotSet | 1.0000 | 1.0000 | Desktop | Windows.Desktop | 1,443.0000 | 275,890.0000 | 4.0000 | 5.0000 | 2,273.0000 | NaN | 238,475.0000 | UNKNOWN | 227,116.0000 | 0 | 4,096.0000 | MiniTower | 18.5000 | 1,366.0000 | 768.0000 | Desktop | unknown | 4,294,967,295.0000 | 10.0.17134.228 | amd64 | rs4_release | 17134 | 228 | Professional | PROFESSIONAL | UUPUpgrade | 17.0000 | 64 | FullAuto | 0 | IS_GENUINE | OEM:NONSLP | NaN | 0.0000 | Retail | NaN | 355.0000 | 20,050.0000 | 0 | NaN | 0.0000 | 0 | 0 | 0.0000 | 0.0000 | 3.0000 | 1 |
| 4 | 000014a5f00daa18e76b81417eeb99fc | win8defender | 1.1.15100.1 | 4.18.1807.18075 | 1.273.1379.0 | 0 | 7.0000 | 0 | NaN | 53,447.0000 | 1.0000 | 1.0000 | 1 | 18 | 37,376.0000 | NaN | 277.0000 | 75 | windows10 | x64 | 10.0.0.0 | 17134 | 768 | rs4 | 17134.1.amd64fre.rs4_release.180410-1804 | Home | 1.0000 | 0 | NaN | 0.0000 | 137.0000 | RequireAdmin | 1.0000 | 1.0000 | Notebook | Windows.Desktop | 1,443.0000 | 331,929.0000 | 4.0000 | 5.0000 | 2,500.0000 | NaN | 476,940.0000 | HDD | 101,900.0000 | 0 | 6,144.0000 | Portable | 14.0000 | 1,366.0000 | 768.0000 | Mobile | lion | 0.0000 | 10.0.17134.191 | amd64 | rs4_release | 17134 | 191 | Core | CORE | Update | 8.0000 | 31 | FullAuto | 0 | IS_GENUINE | Retail | 0.0000 | 0.0000 | Retail | 0.0000 | 355.0000 | 19,844.0000 | 0 | 0.0000 | 0.0000 | 0 | 0 | 0.0000 | 0.0000 | 1.0000 | 1 |
stats = []
for col in train.columns:
stats.append((col, train[col].nunique(), train[col].isnull().sum() * 100 / train.shape[0], train[col].value_counts(normalize=True, dropna=False).values[0] * 100, train[col].dtype))
stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 'Percentage of values in the biggest category', 'type'])
stats_df.sort_values('Percentage of missing values', ascending=False).head(10)
| Feature | Unique_values | Percentage of missing values | Percentage of values in the biggest category | type | |
|---|---|---|---|---|---|
| 28 | PuaMode | 1 | 99.969 | 99.969 | object |
| 41 | Census_ProcessorClass | 3 | 99.573 | 99.573 | object |
| 8 | DefaultBrowsersIdentifier | 288 | 95.110 | 95.110 | float64 |
| 68 | Census_IsFlightingInternal | 1 | 83.163 | 83.163 | float64 |
| 52 | Census_InternalBatteryType | 22 | 71.240 | 71.240 | object |
| 71 | Census_ThresholdOptIn | 2 | 63.738 | 63.738 | float64 |
| 75 | Census_IsWIMBootEnabled | 1 | 63.659 | 63.659 | float64 |
| 31 | SmartScreen | 11 | 35.731 | 48.288 | object |
| 15 | OrganizationIdentifier | 37 | 30.664 | 47.149 | float64 |
| 29 | SMode | 2 | 5.932 | 94.035 | float64 |
Column Observations:¶
High Missing Values:
- Columns
PuaModeand `Census_ProcessorClasare almost useless because almost all of the data is missing, so it would be better to drop themmns.
- Columns
Dominant Category:
- In the
DefaultBrowsersIdentifiercomore than 95% data belong to a single category, which means it is not usablecolumn.
- In the
Potential Analysis:
Census_IsFlightingInternalhas an unusua and would need further deep divenificance.
Imbalanced Columns:
- There are 26 columns where one category contains over 90% of the values. These imbalanced columns should be considered for removal from the dataset.
Numeric vs. Categorical:
- Despite being considered numeric, columns like
CityIdentifiermight be more appropriately treated as categorical. A detailed exploration during EDA will confirm this.
- Despite being considered numeric, columns like
Categorical Columns: p them.
- Except for
Census_SystemVolumeTotalCapacity, all other column's performance.
- Except for
droppable_features = []
droppable_features.append('PuaMode')
droppable_features.append('Census_ProcessorClass')
train['HasDetections'].value_counts()
HasDetections 0 50072 1 49928 Name: count, dtype: int64
Label is balanced which is very good for classification objective.
train['Census_IsTouchEnabled'].value_counts()
Census_IsTouchEnabled 0 87458 1 12542 Name: count, dtype: int64
Most devices are not touch enabled, which is expected as most devices using microsft are windows based and does not have touch functionality.
import plotly.express as px
def plot_categorical_vs_detection(df, column_name):
column_counts = df[column_name].value_counts()
detection_counts = df.groupby([column_name, 'HasDetections']).size().unstack(fill_value=0)
data_for_plotly = detection_counts.reset_index()
melted_data = pd.melt(data_for_plotly, id_vars=[column_name], var_name='HasDetections', value_name='Count')
fig = px.bar(melted_data, x=column_name, y='Count', color='HasDetections',
labels={column_name: column_name, 'Count': 'Count'},
title=f'{column_name} vs HasDetections',
category_orders={'HasDetections': ['0', '1']},
color_discrete_map={'0': '#1f77b4', '1': '#ff7f0e'})
fig.update_layout(width=800, height=600)
fig.show()
plot_categorical_vs_detection(train, 'Census_IsTouchEnabled')
plot_categorical_vs_detection(train, 'Census_PowerPlatformRoleName')
As expected Mobile phones and Desktops are most targeted devices.
sns.set(rc={'figure.figsize':(15, 8)})
sns.countplot(x="SmartScreen", hue="HasDetections", palette="PRGn", data=train)
plt.title("SmartScreen counts")
plt.xticks(rotation='vertical')
plt.show()
plot_categorical_vs_detection(train,'EngineVersion')
Two Engine variants are most popular, would be interesting to check them against touch or not touch devices.
import matplotlib.pyplot as plt
top_engine_versions = ['1.1.15200.1', '1.1.15100.1']
fig, axes = plt.subplots(nrows=1, ncols=len(top_engine_versions), figsize=(12, 5))
for i, version in enumerate(top_engine_versions):
subset_df = train[train['EngineVersion'] == version]
subset_df['Census_IsTouchEnabled'].value_counts().plot(kind='bar', ax=axes[i], color=['#1f77b4', '#ff7f0e'])
axes[i].set_title(f'EngineVersion {version}')
axes[i].set_xlabel('Census_IsTouchEnabled')
axes[i].set_ylabel('Count')
plt.tight_layout()
plt.show()
Both EngineVersion supports fewer number of touch devices.
plot_categorical_vs_detection(train,'AppVersion')
#plot_categorical_vs_detection(train,'AvSigVersion')
AvSigVersion doesn't provide useful information
plot_categorical_vs_detection(train,'AVProductsInstalled')
Interesting thing here is that if a device has antivirus software even then they are more likely to be infected.
plot_categorical_vs_detection(train,'CountryIdentifier')
Most countries have around 50% infection rate. Would be interesting to plot top countries with infection.
country_detection_rates = train.groupby('CountryIdentifier')['HasDetections'].mean().reset_index()
top_countries = country_detection_rates.nlargest(10, 'HasDetections')
fig = px.bar(top_countries, x='CountryIdentifier', y='HasDetections', color='HasDetections',
labels={'CountryIdentifier': 'Country Identifier', 'HasDetections': 'Detection Rate'},
title='Top 10 Countries with Highest Detection Rates',
category_orders={'HasDetections': ['0', '1']},
color_discrete_map={'0': '#1f77b4', '1': '#ff7f0e'})
fig.show()
Intersting, some countries have very high detection rates let's look at the cities.
country_detection_rates = train.groupby('CityIdentifier')['HasDetections'].mean().reset_index()
top_countries = country_detection_rates.nlargest(20, 'HasDetections')
fig = px.bar(top_countries, x='CityIdentifier', y='HasDetections', color='HasDetections',
labels={'CityIdentifier': 'City Identifier', 'HasDetections': 'Detection Rate'},
title='Top 20 Cities with Highest Detection Rates',
category_orders={'HasDetections': ['0', '1']},
color_discrete_map={'0': '#1f77b4', '1': '#ff7f0e'})
fig.show()
plot_categorical_vs_detection(train,'OrganizationIdentifier')
Two organizations cover most of the computers.
top_countries = train['Census_ProcessorCoreCount'].value_counts().head(10).index
subset_df = train[train['Census_ProcessorCoreCount'].isin(top_countries)]
plt.figure(figsize=(10, 6))
subset_df.groupby(['Census_ProcessorCoreCount', 'HasDetections']).size().unstack().plot(kind='bar', stacked=True)
plt.title('Top processor counts with HasDetections')
plt.xlabel('Number of processors in device')
plt.ylabel('Count')
plt.legend(title='HasDetections', loc='upper right', labels=['No Detection', 'Detection'])
plt.show()
<Figure size 1000x600 with 0 Axes>
Most devices have 2,4 or 8 processors and detection rate is similar for all.
top_countries = train['Census_TotalPhysicalRAM'].value_counts().head(10).index
subset_df = train[train['Census_TotalPhysicalRAM'].isin(top_countries)]
plt.figure(figsize=(10, 6))
subset_df.groupby(['Census_TotalPhysicalRAM', 'HasDetections']).size().unstack().plot(kind='bar', stacked=True)
plt.title('Top RAM distributions with HasDetections')
plt.xlabel('RAM(gb) in device')
plt.ylabel('Count')
plt.legend(title='HasDetections', loc='upper right', labels=['No Detection', 'Detection'])
plt.show()
<Figure size 1000x600 with 0 Axes>
sns.countplot(x='Census_ProcessorClass', hue='HasDetections',data=train)
plt.show()
More logical processors seems to have higher infection rate.
sns.countplot(x='ProductName', hue='HasDetections',data=train)
plt.show()
Nothing too interesting as detection rate is almost equal.
sns.countplot(x='Platform', hue='HasDetections',data=train)
plt.show()
Most devices from the data seems to be new as they features latest windows10 (at the time of data collection).
sns.countplot(x='Processor', hue='HasDetections',data=train)
plt.show()
Devices equipped with x64 are targeted more, but we don't have enough data to say anything about arm64 based devices.
sns.countplot(x='SkuEdition', hue='HasDetections',data=train)
plt.show()
Pro devices are targeted more in comparision to home or enterprise based systems.
sns.countplot(x='Census_PrimaryDiskTypeName', hue='HasDetections',data=train)
plt.show()
Devices with HDD are targeted more which makes sense cause SDD devices are compartively new (at the time of the data) so they might have comparatively better security or advanced users.
pd.options.display.float_format = '{:,.4f}'.format
sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns])
sk_df = sk_df.sort_values('skewness', ascending=False)
sk_df
| column | uniq | skewness | |
|---|---|---|---|
| 5 | IsBeta | 1 | 100.0000 |
| 28 | PuaMode | 1 | 100.0000 |
| 75 | Census_IsWIMBootEnabled | 1 | 100.0000 |
| 68 | Census_IsFlightingInternal | 1 | 100.0000 |
| 69 | Census_IsFlightsDisabled | 1 | 100.0000 |
| 27 | AutoSampleOptIn | 2 | 99.9970 |
| 29 | SMode | 2 | 99.9649 |
| 71 | Census_ThresholdOptIn | 2 | 99.9641 |
| 65 | Census_IsPortableOperatingSystem | 2 | 99.9230 |
| 35 | Census_DeviceFamily | 2 | 99.8610 |
| 33 | UacLuaenable | 3 | 99.3813 |
| 76 | Census_IsVirtualDevice | 2 | 99.3358 |
| 1 | ProductName | 2 | 98.8800 |
| 12 | HasTpm | 2 | 98.7270 |
| 7 | IsSxsPassiveMode | 2 | 98.2370 |
| 32 | Firewall | 2 | 97.7326 |
| 11 | AVProductsEnabled | 5 | 97.4982 |
| 6 | RtpStateBitfield | 6 | 97.3163 |
| 20 | OsVer | 11 | 96.7510 |
| 18 | Platform | 4 | 96.6190 |
| 78 | Census_IsPenCapable | 2 | 96.2850 |
| 26 | IsProtected | 2 | 94.4864 |
| 79 | Census_IsAlwaysOnAlwaysConnectedCapable | 2 | 94.2485 |
| 70 | Census_FlightRing | 7 | 93.6380 |
| 45 | Census_HasOpticalDiskDrive | 2 | 92.4150 |
| 55 | Census_OSArchitecture | 3 | 90.9360 |
| 19 | Processor | 3 | 90.9320 |
| 39 | Census_ProcessorManufacturerIdentifier | 4 | 88.2511 |
| 66 | Census_GenuineStateName | 4 | 88.2190 |
| 77 | Census_IsTouchEnabled | 2 | 87.4580 |
| 52 | Census_InternalBatteryType | 22 | 78.3554 |
| 31 | SmartScreen | 11 | 75.1342 |
| 80 | Wdft_IsGamer | 2 | 71.8393 |
| 10 | AVProductsInstalled | 5 | 70.1808 |
| 51 | Census_PowerPlatformRoleName | 9 | 69.2690 |
| 15 | OrganizationIdentifier | 37 | 68.0007 |
| 9 | AVProductStatesIdentifier | 2069 | 65.7153 |
| 43 | Census_PrimaryDiskTypeName | 4 | 64.9045 |
| 34 | Census_MDC2FormFactor | 12 | 64.1180 |
| 22 | OsSuite | 7 | 62.1920 |
| 25 | SkuEdition | 8 | 61.6350 |
| 38 | Census_ProcessorCoreCount | 17 | 61.2817 |
| 47 | Census_ChassisTypeName | 30 | 58.8045 |
| 53 | Census_InternalBatteryNumberOfCharges | 1791 | 58.2057 |
| 3 | AppVersion | 80 | 57.9320 |
| 50 | Census_InternalPrimaryDisplayResolutionVertical | 197 | 55.9912 |
| 41 | Census_ProcessorClass | 3 | 54.5667 |
| 67 | Census_ActivationChannel | 6 | 52.8950 |
| 74 | Census_IsSecureBootEnabled | 2 | 51.5010 |
| 49 | Census_InternalPrimaryDisplayResolutionHorizontal | 170 | 50.8339 |
| 82 | HasDetections | 2 | 50.0720 |
| 46 | Census_TotalPhysicalRAM | 172 | 46.4024 |
| 56 | Census_OSBranch | 15 | 44.9990 |
| 57 | Census_OSBuildNumber | 38 | 44.9950 |
| 64 | Census_OSWUAutoUpdateOptionsName | 6 | 44.1270 |
| 21 | OsBuild | 43 | 43.9550 |
| 23 | OsPlatformSubRelease | 9 | 43.9550 |
| 30 | IeVerIdentifier | 151 | 43.9030 |
| 2 | EngineVersion | 42 | 43.1870 |
| 24 | OsBuildLab | 369 | 41.0554 |
| 59 | Census_OSEdition | 19 | 38.7270 |
| 60 | Census_OSSkuName | 17 | 38.7250 |
| 62 | Census_OSInstallLanguageIdentifier | 39 | 35.9119 |
| 63 | Census_OSUILocaleIdentifier | 63 | 35.5940 |
| 48 | Census_InternalPrimaryDiagonalDisplaySizeInInches | 369 | 34.2545 |
| 42 | Census_PrimaryDiskTotalCapacity | 480 | 31.8463 |
| 72 | Census_FirmwareManufacturerIdentifier | 172 | 31.1065 |
| 61 | Census_OSInstallTypeName | 9 | 29.3290 |
| 17 | LocaleEnglishNameIdentifier | 186 | 23.4290 |
| 81 | Wdft_RegionIdentifier | 15 | 20.9235 |
| 16 | GeoNameIdentifier | 242 | 17.2002 |
| 58 | Census_OSBuildRevision | 219 | 15.7100 |
| 54 | Census_OSVersion | 260 | 15.7090 |
| 36 | Census_OEMNameIdentifier | 860 | 14.5190 |
| 8 | DefaultBrowsersIdentifier | 288 | 10.9202 |
| 13 | CountryIdentifier | 221 | 4.4190 |
| 37 | Census_OEMModelIdentifier | 16265 | 3.4587 |
| 40 | Census_ProcessorModelIdentifier | 1738 | 3.2817 |
| 4 | AvSigVersion | 4027 | 1.1400 |
| 14 | CityIdentifier | 15812 | 1.1153 |
| 73 | Census_FirmwareVersionIdentifier | 12456 | 1.0340 |
| 44 | Census_SystemVolumeTotalCapacity | 45619 | 0.5524 |
| 0 | MachineIdentifier | 100000 | 0.0010 |
Need to drop these categorical columns which have 98% skewness.
sns.countplot(x='Firewall', hue='HasDetections',data=train)
plt.show()
Interesting point, having Firewell doesn't matter much on if your device will be infected or not. Or maybe Firewall is not that effective.
droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist())
droppable_features
['PuaMode', 'Census_ProcessorClass', 'IsBeta', 'PuaMode', 'Census_IsWIMBootEnabled', 'Census_IsFlightingInternal', 'Census_IsFlightsDisabled', 'AutoSampleOptIn', 'SMode', 'Census_ThresholdOptIn', 'Census_IsPortableOperatingSystem', 'Census_DeviceFamily', 'UacLuaenable', 'Census_IsVirtualDevice']
pd.options.display.max_rows = 99
train.Census_InternalBatteryType.value_counts()
Census_InternalBatteryType lion 22535 li-i 2814 # 2028 lip 689 liio 371 li p 90 li 66 nimh 49 real 35 pbac 32 vbox 15 bq20 14 unkn 5 lipp 5 lgi0 3 lhp0 2 ithi 2 lio 1 batt 1 lipo 1 4cel 1 ram 1 Name: count, dtype: int64
trans_dict = {
'?˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown'
}
train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)
train = train.drop(columns=droppable_features)
test = test.drop(columns=droppable_features)
train.shape, test.shape
((100000, 70), (100000, 69))
Conclusion¶
Upon comprehensive analysis, it can be deduced that no single feature is most dominant for this dataset. Multiple features combined together makes up if a device is vulnerable or not. One of the most interesting thing is User handling the device is most important and there's not much provided about them in the data. We know that even an older device which can be susceptible to malwares will be safe in the hands of an experienced user and a newer device with all the top notch security won't make itself secure in the hands of an amateur. Having multiple antiviruses won't make your system secure rather being precauticious at times would be much more helpful. Antiviruses, Windows Defenders, Firewall etcs are to assist user but there is no cure for Human Stupidity as said by famous hacker Kevin Mitnick.
This dataset still requires lot of Exploration and deep dive but that would be too time consuming for this short project.
Limitations: This project doesn't apply any machine learning or deep learning based models. No predictions were made to verify that the columns dropped were actually helpful in creasing the efficiency of the model or were they actually useless. Model based Feature Importance would be interesting to see. This project doesn't process the whole data because of the compute restrictions.